AITopics | human evaluator

Collaborating Authors

human evaluator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

Neural Information Processing SystemsMar-22-2026, 10:38:48 GMT

To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on directly assessing the immediate responses generated by the models based on the given question and context. In the common use case of humans seeking AI assistant's help in finding information, these non-interactive evaluations do not account for the dynamic nature of human-model conversations, and interaction-aware evaluations have shown that accurate models are not necessarily preferred by humans Lee et al. Recent works in human-computer interaction (HCI) have employed human evaluators to conduct interactions and evaluations, but they are often prohibitively expensive and time-consuming to scale. In this work, we introduce an automated evaluation framework IQA-EVAL to Interactive Question Answering Evaluations, more specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions. Moreover, we propose assigning personas to LEAs to better simulate groups of real human evaluators. We show that: (1) our evaluation framework with GPT-4 (or Claude) as the backbone model achieves a high correlation with human evaluations on the IQA task; (2) assigning personas to LEA to better represent the crowd further significantly improves correlations. Finally, we use our automated metric to evaluate five recent LLMs with over 1000 questions from complex and ambiguous question answering tasks, which would cost $5k if evaluated by humans.

artificial intelligence, large language model, natural language, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

9c3563bbeb2ad7f3b3b8ed0fcd3b440f-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-16-2026, 23:02:33 GMT

category, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country: Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)

Add feedback

Evaluation of the phi-3-mini SLM for identification of texts related to medicine, health, and sports injuries

Brogly, Chris, Rjaibi, Saif, Liang, Charlotte, Lam, Erica, Wang, Edward, Levitan, Adam, Paleczny, Sarah, Cusimano, Michael

arXiv.org Artificial IntelligenceNov-21-2025

Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such as MedQA, although performance on these can vary especially depending on the size of the model in terms of number of parameters. Furthermore, these test results may not necessarily reflect real-world performance regarding the automatic labelling or identification of texts in documents and the web. As a result, we compared topic-relatedness scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness scores from 7 human evaluators on 1144 samples of medical/health-related texts and 1117 samples of sports injury-related texts. These texts were from a larger dataset of about 9 million news headlines, each of which were processed and assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered) based on 1 (low filtering) or more (high filtering) Boolean conditions on the phi-3 SLM scores. We found low-moderate significant correlations between the scores from the SLM and human evaluators for sports injury texts with low filtering (\r{ho} = 0.3413, p < 0.001) and medicine/health texts with high filtering (\r{ho} = 0.3854, p < 0.001), and low significant correlation for medicine/health texts with low filtering (\r{ho} = 0.2255, p < 0.001). There was negligible, insignificant correlation for sports injury-related texts with high filtering (\r{ho} = 0.0318, p = 0.4466).

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICMI65310.2025.11141224

2504.08764

Country: North America > Canada > Ontario (0.16)

Genre: Research Report > New Finding (0.88)

Industry:

Media > News (0.46)
Health & Medicine > Consumer Health (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Comparing human and LLM politeness strategies in free production

Zhao, Haoran, Hawkins, Robert D.

arXiv.org Artificial IntelligenceOct-31-2025

Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.09391

Country: North America > United States > California (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

9c3563bbeb2ad7f3b3b8ed0fcd3b440f-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 11:09:53 GMT

category, evaluation, gpt-4o, (16 more...)

Neural Information Processing Systems

Country: Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)

Add feedback

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Ji, Zhengran, Chen, Boyuan

arXiv.org Artificial IntelligenceOct-8-2025

Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.

machine learning, reinforcement learning, reward model, (14 more...)

arXiv.org Artificial Intelligence

2508.07126

Genre: Research Report > New Finding (0.93)

Industry: Education > Educational Setting > Online (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Juzek, Tom S., Ward, Zina B.

arXiv.org Artificial IntelligenceAug-5-2025

Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. Using Meta's Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations -- namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.

large language model, machine learning, variant, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.17605/OSF.IO/JY48S

2508.0193

Country:

Africa (1.00)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation

Zhong, Ruican, McDonald, David W., Hsieh, Gary

arXiv.org Artificial IntelligenceJul-4-2025

Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2507.02306

Country:

North America > United States > Washington > King County > Seattle (0.14)
Oceania > Australia > Queensland (0.04)
North America > United States > Virginia > Alexandria County > Alexandria (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.48)
Government (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Proietti, Lorenzo, Perrella, Stefano, Navigli, Roberto

arXiv.org Artificial IntelligenceJun-25-2025

In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.

artificial intelligence, computational linguistic, natural language, (13 more...)

arXiv.org Artificial Intelligence

2506.19571

Country:

Europe (1.00)
North America > United States (0.94)
Asia > Middle East > UAE (0.15)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Audio-Aware Large Language Models as Judges for Speaking Styles

Chiang, Cheng-Han, Wang, Xiaofei, Lin, Chung-Ching, Lin, Kevin, Li, Linjie, Kopetz, Radu, Qian, Yao, Wang, Zhendong, Yang, Zhengyuan, Lee, Hung-yi, Wang, Lijuan

arXiv.org Artificial IntelligenceJun-9-2025

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.05984

Country: Asia > Japan (0.28)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback